Reading Assignment 6

NVIDIA Guest Lecture, openACC and CUDA Optimizations

Write your answers in a PDF and upload the document on Gradescope for submission. The due date is given on Gradescope. Each question is worth 10 points.

30 NVIDIA guest lecture, openACC; Video; Slides

Additional resources:

Please watch videos 30 and 31 and the slide decks before answering these questions.

  1. On slide 39, explain why we have short communication steps on the profiling diagram. Use the code from slide 31. Which variable is being exchanged between the CPU and GPU? Is this data exchanged required? Can it be avoided?
  2. Explain the difference between copy, copyin, and copyout.
  3. On slide 54, explain what the directive
    #pragma acc update device(A[0:N])
    

    does.

  4. On slide 58, explain what the directive
    #pragma acc kernels loop tile(32, 32)
    

    does. Explain how this could have been used in Homework 4. Which implementation variant (Global, Block, Shared) corresponds (approximately) to the openAcc tile(32,32) clause?

31 NVIDIA guest lecture, CUDA optimization; Video; Slides

  1. Use Littleā€™s Law to estimate the achieved bandwidth if on average you have 5 persons on the escalator. Use the parameters on slide 7.
  2. What are the two conditions that must be satisfied for an instruction to become eligible for issue?
  3. For the L2 cache, what is the size of a cache line? What is the size of sector?

Definition: a cache line is a contiguous segment of the cache memory. A directory is used to map a cache line to the corresponding segment in main memory. In a sectored-cache, the cache line is subdivided into sectors. When a cache miss occurs (the data is not found in the cache), only the sector containing the referenced data item is transferred from main memory (rather than the entire cache line). This allows reducing the penalty for cache misses while minimizing the cost of maintaining the cache line directory.